This report explores a dataset containing quality and attributes for approximately 1600 bottles of red wine.

Univariate Plots Section

The numbers of rows and columns in the data are

## [1] 1599
## [1] 13
## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.factor      : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality      quality.factor
##  Min.   : 8.40   Min.   :3.000   3: 10         
##  1st Qu.: 9.50   1st Qu.:5.000   4: 53         
##  Median :10.20   Median :6.000   5:681         
##  Mean   :10.42   Mean   :5.636   6:638         
##  3rd Qu.:11.10   3rd Qu.:6.000   7:199         
##  Max.   :14.90   Max.   :8.000   8: 18

Our dataset consists of 13 variables, with almost 1,600 observations.

Mostly red wine quality is 5 or 6. Why most of the wine quality falls into 5 or 6? What makes wine of better quality? Are those too poory or highly evaluated wines outliers? I wonder what this plot looks like across other variables.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

As a result,

  • Fixed acidity is skewed to the right, with most wines of lesser fixed acidity from 6 to 8g / dm^3.
  • Volatile acidity is slightly skewed to the right, with most wines of lesser volatile acidity from 0.2 to 0.8g / dm^3.
  • Citric acid is used less frequently than other kinds of acid according to Wikipedia, so I leave many 0 values in citric acid category as they are.
  • Residual sugar is skewed to the right, with most wines of lesser residual sugar from 2 to 4g / dm^3.
  • Chrorides is skewed to the right, with most wines of lesser chlorides from 0.0 to 0.1g / dm^3.
  • Free sulfur dioxide is skewed to the right, with most wines of lesser free surfur dioxide from 0 to 20mg / dm^3.
  • Total sulfur dioxide is skewed to the right, with most wines of lesser total surfur dioxide from 0 to 50mg / dm^3.
  • Distribution of density is close to standard diviation, with majority wines of 3.3pH.
  • Distribution of pH is close to standard diviation, with majority wines of 0.997g / dm^3.
  • Alcohol is skewed to the right, with most wines of lesser alcohole from 9 to 10%.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1400  0.2800  0.2954  0.4400  1.0000

I omitted 0 values from citric.acid category and plotted again. There seems to be three peaks at 0.02, 0.24 and 0.49.

For skewed with long features data, I conducted log transformation to better understand their distribution and data is showed as below.

## wineQualityReds$residual.sugar
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
## log10(wineQualityReds$residual.sugar)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.04576  0.27875  0.34242  0.36925  0.41497  1.19033

## wineQualityReds$chlorides
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## log10(wineQualityReds$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.921  -1.155  -1.102  -1.088  -1.046  -0.214

## wineQualityReds$total.sulfur.dioxide
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
## log10(wineQualityReds$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7782  1.3424  1.5798  1.5638  1.7924  2.4609

## wineQualityReds$alcohol
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## log10(wineQualityReds$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9243  0.9777  1.0086  1.0158  1.0453  1.1732

After log transformation,

  • Distibution of residual.sugar is peaking around 3g / dm^3.
  • Distibution of chlorides is peaking around 0.9g / dm^3.
  • Distibution of total.sulfur.dioxide is peaking around 50g / dm^3.
  • Distibution of alcohol is peaking around 8g / dm^3.

Total acidity is slightly skewed to right, with most wines of lesser acidity around 7.5g / dm^3.

I also created 3 categories divided by quality score:

  • low: quality score 0 - 4
  • middle: quality score 5 - 6
  • high: quality score 7 - 10

This would help us to achieve a better understandings rough flow of “quality” feature.

When I plot quality.cut, it looks like a chart below.

##    low middle   high 
##     63   1319    217

With this categorization, most of the red wines(1319 out of 1599) fall into “middle” section.

Univariate Analysis

What is the structure of your dataset?

There are 1599 red wines in the dataset with 13 attributes (X, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality). Quality is ordered factorwith integer numbers from 3 to 8.

Other observations:

  • Mostly red wine quality is 5 or 6.
  • Fixed acidity is skewed to the right, with most wines of lesser fixed acidity from 6 to 8g / dm^3.
  • Volatile acidity is slightly skewed to the right, with most wines of lesser volatile acidity from 0.2 to 0.8g / dm^3.
  • Citric acid has three peaks at 0.02, 0.24 and 0.49 without 0 values.
  • Residual sugar is skewed to the right, with most wines of lesser residual sugar from 2 to 4g / dm^3.
  • Chrorides is skewed to the right, with most wines of lesser chlorides from 0.0 to 0.1g / dm^3.
  • Free sulfur dioxide is skewed to the right, with most wines of lesser free surfur dioxide from 0 to 20mg / dm^3.
  • Total sulfur dioxide is skewed to the right, with most wines of lesser total surfur dioxide from 0 to 50mg / dm^3.
  • Distribution of density is close to standard diviation, with majority wines of 3.3pH.
  • Distribution of pH is close to standard diviation, with majority wines of 0.997g / dm^3.
  • Alcohol is skewed to the right, with most wines of lesser alcohole from 9 to 10%.

What is/are the main feature(s) of interest in your dataset?

The main feature of this dataset is quality of red wines and its relation to other attributes. I would like to clarify what are the main factors to determine wine quality. I hope to create predictive model to wine quality combining some variables within the dataset.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Other features of fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, and alcohol are likely to affect wine quality. After reading some article online, I suspect volatile acidity and sulphates adversely affect to lower wine quality.

Did you create any new variables from existing variables in the dataset?

I created a new variable, ‘total acidity’, by adding fixed acidity and volatile acidity to citic acid, because total acidity is one of the important factor to evaluate wine quality in balance with sweetness and bitterness. Contrary to pH, which refers to strength of acidity, total acidity is total amount of all acids present.

I also created “quality.cut” categorization, which divide wine quality into three groups: “low”, “middle”, “high”. There are 63, 1319 and 27 wines in each category. I hope this categorization helps us to understand rough tendency of good wine / poor wine.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I log-transformed right skewed distributions, which is residual.sugar, chlorides, total.sulfur.dioxide and alcohol.

Ovserbations after log-tranformation:

  • Distibution of residual.sugar is peaking around 3g / dm^3.
  • Distibution of chlorides is peaking around 0.9g / dm^3.
  • Distibution of total.sulfur.dioxide is peaking around 50g / dm^3.
  • Distibution of alcohol is peaking around 8g / dm^3.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
## total.acidity           0.99482800     -0.156620601  0.62825187
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
## total.acidity           0.117473729  0.102183639        -0.158241719
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
## total.acidity                 -0.10760684  0.68488647 -0.67314051
##                         sulphates     alcohol     quality total.acidity
## fixed.acidity         0.183005664 -0.06166827  0.12405165    0.99482800
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778   -0.15662060
## citric.acid           0.312770044  0.10990325  0.22637251    0.62825187
## residual.sugar        0.005527121  0.04207544  0.01373164    0.11747373
## chlorides             0.371260481 -0.22114054 -0.12890656    0.10218364
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606   -0.15824172
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029   -0.10760684
## density               0.148506412 -0.49617977 -0.17491923    0.68488647
## pH                   -0.196647602  0.20563251 -0.05773139   -0.67314051
## sulphates             1.000000000  0.09359475  0.25139708    0.15956033
## alcohol               0.093594750  1.00000000  0.47616632   -0.08426530
## quality               0.251397079  0.47616632  1.00000000    0.08570932
## total.acidity         0.159560329 -0.08426530  0.08570932    1.00000000

From the table above, I expect to see correlations between these features;

  • residual.sugar and density
  • free.sulfur.dioxide and total.sulfur.dioxide
  • pH and total.acidity
  • pH and citric.acid

Relations of these features and R^2 values are drawn and calculated as below.

## R^2 value of residual.sugar and density
## [1] 0.1262263

## R^2 value of free.sulfur.dioxide and total.sulfur.dioxide
## [1] 0.4457785

## R^2 value of total.acidity and pH
## [1] 0.4531181

## R^2 value of citric.acid and pH
## [1] 0.2936601

Observations:

  • “Free.sulfur.dioxide and total.sulfur.dioxide” and “pH and total.acidity” have strong correlation with about 45% R^2 value.
  • pH and citric.acid has moderate correlation with R^2 value 29%.
  • Residual.sugar and density has weak correlation with R^2 value 13%.

Also from the correlation table above, “fixed.acidity”, “volatile.acidity”, “citric.acid”, “chlorides”, “total.sulfur.dioxide”, “density”, “sulphates”, “alcohol”, and “total.acidity” have higher correlation with quality, so I am going to draw a Matrix chart with these features.

Quality

  • All features do not seem to have strong correlations, but moderately correlated with quality.

fixed.acidity

  • has strong positive correlation with citric.acid and density.

volatile.acidity

  • has strong negative correlation with citric.acid.

citric.acid

  • has moderate positive correlation with chlorides, density and sulphates.

chlorides

  • has moderate positive correlation with density and sulphates.
  • has moderate negative correlation with alcohol.

total.sulfur.dioxide

  • has moderate negative correlation with alcohol.

density

  • has strong negative correlation with alcohol.
  • has moderate positive correlation with sulphates.

About these correlations, I want to look closer at scatter plots and box plots.

Quality

Findings from box plots:

  • Amount of fixed acidity is almost the same for all quality of wines, except slightly hier for wines with score 7.
  • The more volatile acidity a wine has, the lower quality it scores.
  • The more citric acid a wine has, the higher quality it scores.
  • Median of chlorides do not differ much between qulaities, but wines with quality score 5 and 6 have more outliers abpve upper quartile.
  • Correlation of total sulfur dioxide and quality is positive for wines with quality score from 3 to 5, however it turns to negatove for wines whose quality score is between 5 to 8.
  • Density is mostly negatively correlated to quality, especially for wines with score 7 and 8.
  • Sulphates has positive correlation with quality.
  • Alcohol is positively correlated to quality, especially for wines with score above 6.

I calculated statistical scores of variables which might have correlation with quality.

Quality and volatile.acidity

## 
## Call:
## lm(formula = quality ~ volatile.acidity, data = subset(wineQualityReds, 
##     volatile.acidity <= quantile(wineQualityReds$volatile.acidity, 
##         0.999)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.78977 -0.54547 -0.01325  0.47198  2.92568 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.55757    0.05841  112.27   <2e-16 ***
## volatile.acidity -1.74500    0.10503  -16.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7436 on 1596 degrees of freedom
## Multiple R-squared:  0.1474, Adjusted R-squared:  0.1469 
## F-statistic:   276 on 1 and 1596 DF,  p-value: < 2.2e-16

Quality and citric.acid

## 
## Call:
## lm(formula = quality ~ citric.acid, data = subset(wineQualityReds, 
##     citric.acid <= quantile(wineQualityReds$citric.acid, 0.999)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.01809 -0.59820  0.09909  0.50922  2.59711 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.37360    0.03371 159.384   <2e-16 ***
## citric.acid  0.97651    0.10144   9.627   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7847 on 1595 degrees of freedom
## Multiple R-squared:  0.05491,    Adjusted R-squared:  0.05432 
## F-statistic: 92.68 on 1 and 1595 DF,  p-value: < 2.2e-16

Quality and alcohol

## 
## Call:
## lm(formula = quality ~ alcohol, data = subset(wineQualityReds, 
##     alcohol <= quantile(wineQualityReds$alcohol, 0.999)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8489 -0.4065 -0.1787  0.5176  2.5909 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.81782    0.17512   10.38   <2e-16 ***
## alcohol      0.36646    0.01672   21.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7083 on 1596 degrees of freedom
## Multiple R-squared:  0.2314, Adjusted R-squared:  0.2309 
## F-statistic: 480.4 on 1 and 1596 DF,  p-value: < 2.2e-16

From the result of R^2 scores,

  • Volatile.acidity explains about 15% of the variance in quality.
  • Citric acid explains about only 5% of the variance in quality.
  • Alcohol explains about 23% of the variance in quality.
## wine quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600
## wine quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000
## wine quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900
## wine quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800
## wine quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600
## wine quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

Although I suspected that citric.acid is one of the cause of wine fault, against my intuition minimum number is largest at quality 8.

While 2 < quality < 6:

## 
## Call:
## lm(formula = quality ~ total.sulfur.dioxide, data = subset(wineQualityReds35, 
##     total.sulfur.dioxide <= quantile(wineQualityReds35$total.sulfur.dioxide, 
##         0.999)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.89308  0.03613  0.10220  0.14153  0.17457 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4.8159953  0.0221026 217.892  < 2e-16 ***
## total.sulfur.dioxide 0.0015732  0.0003368   4.671 3.56e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3354 on 741 degrees of freedom
## Multiple R-squared:  0.0286, Adjusted R-squared:  0.02729 
## F-statistic: 21.82 on 1 and 741 DF,  p-value: 3.564e-06

While 5 < quality < 9:

## 
## Call:
## lm(formula = quality ~ total.sulfur.dioxide, data = subset(wineQualityReds68, 
##     total.sulfur.dioxide <= quantile(wineQualityReds68$total.sulfur.dioxide, 
##         0.999)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3466 -0.3021 -0.2588  0.6578  1.8335 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           6.3597848  0.0302518 210.229  < 2e-16 ***
## total.sulfur.dioxide -0.0021961  0.0006457  -3.401 0.000702 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4883 on 852 degrees of freedom
## Multiple R-squared:  0.0134, Adjusted R-squared:  0.01224 
## F-statistic: 11.57 on 1 and 852 DF,  p-value: 0.0007016

When I split the dataset into two groups: one is wine quality of 3-5, another is wine quality of 6-8, total.sulfur.dioxide explains only about 3% for wine of quality score 3-5 and 1% for wine of quality score 6-8.

fixed.acidity

From the chart matrix, fixed.acidity seems to have strong positive correlation with citric.acid and density.

## 
## Call:
## lm(formula = fixed.acidity ~ citric.acid, data = subset(wineQualityReds, 
##     citric.acid <= quantile(wineQualityReds$citric.acid, 0.999)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7891 -0.8217 -0.0324  0.8059  5.9591 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.6870     0.0553  120.92   <2e-16 ***
## citric.acid   6.0283     0.1664   36.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.287 on 1595 degrees of freedom
## Multiple R-squared:  0.4515, Adjusted R-squared:  0.4511 
## F-statistic:  1313 on 1 and 1595 DF,  p-value: < 2.2e-16

From the R^2 = 0.45, the variance in fixed.acidity is explainted with citric.acid by about 45%.

volatile.acidity

From the chart matrix, volatile.acidity seems to have strong negative correlation with citric.acid.

## 
## Call:
## lm(formula = volatile.acidity ~ citric.acid, data = subset(wineQualityReds, 
##     citric.acid <= quantile(wineQualityReds$citric.acid, 0.999)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36880 -0.09851 -0.01599  0.07528  0.91314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.66686    0.00640  104.19   <2e-16 ***
## citric.acid -0.51458    0.01926  -26.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.149 on 1595 degrees of freedom
## Multiple R-squared:  0.3093, Adjusted R-squared:  0.3088 
## F-statistic: 714.1 on 1 and 1595 DF,  p-value: < 2.2e-16

Based on the R^2 value, citric.acid explains about 31% of the variance in volatile.acidity.

citric.acid

From the chart matrix, citric.acid seems to have moderate positive correlation with chlorides, density and sulphates.

## 
## Call:
## lm(formula = chlorides ~ citric.acid, data = subset(wineQualityReds, 
##     chlorides <= quantile(wineQualityReds$chlorides, 0.999)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.07520 -0.01789 -0.00667  0.00479  0.36411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.076212   0.001832  41.603  < 2e-16 ***
## citric.acid 0.039227   0.005511   7.118 1.65e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04264 on 1595 degrees of freedom
## Multiple R-squared:  0.03079,    Adjusted R-squared:  0.03018 
## F-statistic: 50.67 on 1 and 1595 DF,  p-value: 1.646e-12

## 
## Call:
## lm(formula = density ~ citric.acid, data = subset(wineQualityReds, 
##     density <= quantile(wineQualityReds$density, 0.999)))
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0073433 -0.0009327  0.0000387  0.0010650  0.0059090 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.9957862  0.0000747 13330.2   <2e-16 ***
## citric.acid 0.0035142  0.0002239    15.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001743 on 1595 degrees of freedom
## Multiple R-squared:  0.1338, Adjusted R-squared:  0.1333 
## F-statistic: 246.4 on 1 and 1595 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = sulphates ~ citric.acid, data = subset(wineQualityReds, 
##     sulphates <= quantile(wineQualityReds$sulphates, 0.999)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30786 -0.09673 -0.02761  0.06082  1.29107 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.586730   0.006659   88.11   <2e-16 ***
## citric.acid 0.257854   0.020004   12.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1551 on 1595 degrees of freedom
## Multiple R-squared:  0.09434,    Adjusted R-squared:  0.09377 
## F-statistic: 166.1 on 1 and 1595 DF,  p-value: < 2.2e-16

Based on the R^2 value, chlorides and sulphates explains about less than 10% of the variance in citric.acid, on the other hand, density does about 13%.

chlorides

From the chart matrix, chlorides seems to have moderate positive correlation with density and sulphates and negative correlation with alcohol.

## 
## Call:
## lm(formula = chlorides ~ density, data = subset(wineQualityReds, 
##     density <= quantile(wineQualityReds$density, 0.999)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.04630 -0.01651 -0.00836  0.00103  0.52340 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.6723     0.6134  -7.617 4.42e-14 ***
## density       4.7752     0.6154   7.759 1.51e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04603 on 1595 degrees of freedom
## Multiple R-squared:  0.03637,    Adjusted R-squared:  0.03577 
## F-statistic: 60.21 on 1 and 1595 DF,  p-value: 1.514e-14

## 
## Call:
## lm(formula = chlorides ~ sulphates, data = subset(wineQualityReds, 
##     sulphates <= quantile(wineQualityReds$sulphates, 0.999)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12513 -0.01901 -0.00440  0.00882  0.46687 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.025119   0.004422    5.68 1.59e-08 ***
## sulphates   0.094452   0.006538   14.45  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04255 on 1595 degrees of freedom
## Multiple R-squared:  0.1157, Adjusted R-squared:  0.1152 
## F-statistic: 208.7 on 1 and 1595 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = chlorides ~ alcohol, data = subset(wineQualityReds, 
##     alcohol <= quantile(wineQualityReds$alcohol, 0.999)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.06279 -0.01865 -0.00872  0.00327  0.51344 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.190591   0.011351  16.791   <2e-16 ***
## alcohol     -0.009897   0.001084  -9.133   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04591 on 1596 degrees of freedom
## Multiple R-squared:  0.04966,    Adjusted R-squared:  0.04907 
## F-statistic: 83.41 on 1 and 1596 DF,  p-value: < 2.2e-16

Based on the R^2 value, density and alcohol explains about less than 10% of the variance in chlorides, on the other hand, sulphates does about 12%.

total.sulfur.dioxide

From the chart matrix, total.sulfur.dioxide seems to have moderate negative correlation with alcohol.

## 
## Call:
## lm(formula = total.sulfur.dioxide ~ alcohol, data = subset(wineQualityReds, 
##     alcohol <= quantile(wineQualityReds$alcohol, 0.999)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.008 -23.217  -8.064  13.936 254.729 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 113.9790     7.9572   14.32   <2e-16 ***
## alcohol      -6.4804     0.7597   -8.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.18 on 1596 degrees of freedom
## Multiple R-squared:  0.0436, Adjusted R-squared:  0.043 
## F-statistic: 72.76 on 1 and 1596 DF,  p-value: < 2.2e-16

Based on the R^2 value, alcohol explains only about 4% of the variance in total.sulfur.dioxide.

density

From the chart matrix, density seems to have strong negative correlation with alcohol and moderate positive correlation with sulphates.

## 
## Call:
## lm(formula = density ~ alcohol, data = subset(wineQualityReds, 
##     alcohol <= quantile(wineQualityReds$alcohol, 0.999)))
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0049662 -0.0010828 -0.0002425  0.0008610  0.0073845 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.0060274  0.0004043 2488.42   <2e-16 ***
## alcohol     -0.0008907  0.0000386  -23.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001635 on 1596 degrees of freedom
## Multiple R-squared:  0.2502, Adjusted R-squared:  0.2497 
## F-statistic: 532.5 on 1 and 1596 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = density ~ sulphates, data = subset(wineQualityReds, 
##     sulphates <= quantile(wineQualityReds$sulphates, 0.999)))
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0065505 -0.0011168  0.0000013  0.0011520  0.0067538 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 0.9956368  0.0001941 5130.025  < 2e-16 ***
## sulphates   0.0016875  0.0002869    5.881 4.96e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001868 on 1595 degrees of freedom
## Multiple R-squared:  0.02122,    Adjusted R-squared:  0.02061 
## F-statistic: 34.59 on 1 and 1595 DF,  p-value: 4.956e-09

Based on the R^2 value, alcohol explains about 25% of the variance in density, but sulphates does only about 2%.

To sum up this section, although volatile.acidity, citric.acid, total.sulfur.dioxide and alcohol seemed to have correlation with quality from boxplots, their R^2 value was not high. There might be too many outliers or other factors to affect quality. Also, minimum citric.acid was biggest within the data subset of quality score 8. This was against my expectation that citric.acid adds unpleasant taste to wine. Again, based on the R^2 value, correlation between alcohol and density was unexpectedly high at 25%.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features
in the dataset?

From observation of box plots which include quality and other features’ data:

  • Amount of fixed acidity is almost the same for all quality of wines, except slightly hier for wines with score 7.
  • The more volatile acidity a wine has, the lower quality it scores.
  • The more citric acid a wine has, the higher quality it scores.
  • Median of chlorides do not differ much between qulaities, but wines with quality score 5 and 6 have more outliers above upper quartile.
  • Correlation of total sulfur dioxide and quality is positive for wines with quality score from 3 to 5, however it turns to negatove for wines whose quality score is between 5 to 8.
  • Density is mostly negatively correlated to quality, especially for wines with score 7 and 8.
  • Sulphates has positive correlation with quality.
  • Alcohol is positively correlated to quality, especially for wines with score above 6.

From observation of matrix chart of some attributes(fixed.acidity, volatile.acidity, citric.acid, chlorides, total.sulfur.dioxide, density, sulphates, alcohol, and total.acidity:

  • Fixed.acidity has strong positive correlation with citric.acid and density.
  • Volatile.acidity has strong negative correlation with citric.acid.
  • Citric.acid has moderate positive correlation with chlorides, density and sulphates.
  • Chlorides has moderate positive correlation with density and sulphates and moderate negative correlation with alcohol.
  • Total.sulfur.dioxide has moderate negative correlation with alcohol.
  • Density has strong negative correlation with alcohol.
  • Density has moderate positive correlation with sulphates.

From observation of a correlation table which include all features:

  • “Free.sulfur.dioxide and total.sulfur.dioxide”" and “pH and total.acidity” have strong correlation with about 45% R^2 value.
  • pH and citric.acid has moderate correlation with R^2 value 29%.
  • Residual.sugar and density has weak correlation with R^2 value 13%.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

With R^2 value basis, correlation between alcohol and density is unexpectedly high at about 25%. After I searched online to know that alcohol density is 0.8g/ml, this looks more natural now though.

What was the strongest relationship you found?

Based on R^2 value to explain the variance in quality, alcohol explains about 23% and the most. Although its R^2 score is not as high as alcohol, volatile.acidity also explains variance in wine quality by anout 15%.

Multivariate Plots Section

From the bivariate plots, I expect to find more relations with wine quality and other attributions; volatile.acidity, citric.acid, total.sulfur.dioxide, density, sulphates, and alcohol. Also, since alcohol seems to be one of the important factor for good wine, I would like to explore more about relations between alcohol and seemingly correlated attributions; chlorides, total.sulfur.dioxide and density.

Alcohol and chlorides, total.sulfur.dioxide and density

I created scatter plots of conbinations of chlorides and density, total.sulfur.dioxide and density, total.sulfur.dioxide and chlorides, conducting log-tranformation for skewed distribution data.

From the plots, chlorides and density combination seems to have the strongest correlation with alcohol among three. I wonder how it looks like when it is divided by quality and removed outliers.

By removing a part of density data, which is above third quartile, and dividing by quality, it became clearer that as the higher density a wine has, the lower percentage of alcohol it contains. Also, a wine with density below 0.9925 is likely to fall into quality score 6 and wine with density above 1.0000 is likely to have quality score 5 or 6.

Quality and volatile.acidity, citric.acid, total.sulfur.dioxide, density,
sulphates, and alcohol

Ovserbing the bivariate plots, I suspected there are different trends between wines of low quality and middle quality and high quality. So I have plotted ’ volatile.acidity, citric.acid, total.sulfur.dioxide, density, sulphates, and alcohol combinations colored by quality.cut. From the univariate and bivariate analysis, I expected that citric.acid, sulphates and alcohol have relatively positive correlation and that volatile.acidity and density have relatively negative correlation with quality. Also, total.sulfur.dioxide’s distribution seems to be close to standard diviation over quality. Combining positive and negative correlated features each other, I plotted three scatter plots: “alcohol and citric.acid”, “sulphates and citric.acid”, “total.sulfur.dioxide and citric.acid”, “volatile.acidity and density” and “volatile.acidity and total.sulfur.dioxide” colored by quality.cut. For skewed data, log transformation is applied, so that their tendency of distribution becomes clearer. When plotting citric.acid, data with citric.acid == 0 is removed, for easier observation of correlation.

Among three plots above, the plot of log10(sulphates) and citric.acid is showing the clearest correlation with quality and each attributions. Good wine tends to have more citric.acid and sulphates, apperaing on the upper right.

Since R^2 score in correlation with quality was highest for alcohol(25%) and volatile.acidity(15%), I have plotted a scatter plot with these two features, too.

Thus good wines are mainly grouped on the upper left with higher percentage of alcohol and lesser volatile.acidity.

I would like to explore more about correlations between log10(sulphates) and citric.acid, and volatile.acidity and alcohol.

## 
## Calls:
## lm: lm(formula = log10.sulphates ~ citric.acid, data = wineQualityReds)
## 
## ================================
##   (Intercept)        -0.238***  
##                      (0.004)    
##   citric.acid         0.165***  
##                      (0.012)    
## --------------------------------
##   R-squared           0.110     
##   adj. R-squared      0.109     
##   sigma               0.092     
##   F                 197.186     
##   p                   0.000     
##   Log-likelihood   1553.693     
##   Deviance           13.409     
##   AIC             -3101.387     
##   BIC             -3085.255     
##   N                1599         
## ================================

It is interesting that correlation is turning to negative only for wine with quality score 8 and variance of log10(sulphates) is smaller.

## 
## Calls:
## lm: lm(formula = volatile.acidity ~ alcohol, data = wineQualityReds)
## 
## ================================
##   (Intercept)         0.882***  
##                      (0.043)    
##   alcohol            -0.034***  
##                      (0.004)    
## --------------------------------
##   R-squared           0.041     
##   adj. R-squared      0.040     
##   sigma               0.175     
##   F                  68.138     
##   p                   0.000     
##   Log-likelihood    515.359     
##   Deviance           49.139     
##   AIC             -1024.718     
##   BIC             -1008.587     
##   N                1599         
## ================================

There is a strong correlation between alcohol and volatile.acidity for data points with quality == 3 or 8.

To summarize,

  • Based on the plot of log10(chlorides) and density,
    • density and alcohol have correlation.
    • log10(chlorides) and alcohol have weak correlation.
    • density of below 0.9925 data points are good wine scoring 6 to 8.
    • density of above 1.0000 data points are tend to be good wine scoring 5 to
  • From the plot of log10(sulphates) and citric.acid,
    • good wine tends to appear on the upper right of plots with more citric.acid and sulphates.
    • its correlation is positive for data points with quality score 3 to 7, but turns negative for data points with quality == 8.
  • From the plot of volatile.acidity and alcohol,
    • good wine tends to appear on the upper left with higher percentage of alcohol and lesser volatile.acidity.
    • its correlation is more obvious for data with quality score 3 or 8.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Volatile.acidity and alcohol seem to have the best combination to look at wine quality. Looking at a plot of volatile.acidity and alcohol, good wines are mainly positioned on the upper left with higher percentage of alcohol and lesser volatile.acidity. A plot of log10 transformation of sulphates and citric.acid also shows correlation each other. Here good wines are on upper right corner with bigger number of log10(sulphates) and more citric.acid.

Were there any interesting or surprising interactions between features?

On the plot of log10(chlorides) and density, by removing a part of density data, which is above third quartile, and dividing by quality, it became clearer that as the higher density a wine has, the lower percentage of alcohol it contains. Also, a wine with density below 0.9925 is likely to fall into quality score 6 or 7 or 8 and wine with density above 1.0000 is likely to have quality score 5 or 6. It is interesting that although there is no strong correlation, lesser density wine tends to be a good one.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths
and limitations of your model.


Final Plots and Summary

Plot One

Description One

This plot indicates correlation between alcohol and quality. Also, when density is below 0.9925 or above 1.000, quality seems to fall above 5.

Plot Two

Description Two

From this plot, correlation between quality and citric.acid and log10(sulphates) can be seen.

Plot Three

Description Three

This plot is showing correlation between quality and alcohol and volatile.acidity, and at the same time, stronger correlation between alcohol and volatule.acidity at score 3 and 8.


Reflection

This red wine data contains 1599 data with 13 attribution. Not bwing familiar with wine tasting, I had to start from understanding what each attribute means and how they affect to wine taste. After all, there seemed to be no paired effect of wine taste to each attribute. As the amount of attributes varies, wine quality also varies non-linearly. That is why I used matrix chart to find relatively correlated attributes to quality. From this chart, I cound find alcohol, sulphates, density, total.sulfur.dioxide, citric.acid and volatile.acidity had relatively strong relation with quality. From online articles I read I was minunderstanding that citric.acid gives fault to wine taste, so I was feeling dtrange when I found positive correlation between quality and citric.acid. I searched again then correctly understood that citric.acid gives wine teste flaw only after they are consumed too much. For a future work, wine taste seems to be compounded with several features and balances of each features, it would be possible to investigate further for appropriate compositionand balance to predict a good wine.

Citation

[1] Kaggle. Red Wine Dataset. Retrieved from https://www.kaggle.com/piyushgoyal443/red-wine-dataset#wineQualityInfo.txt
[2] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[3] ScienceDirect. Wine Quality. Retrieved from https://www.sciencedirect.com/topics/food-science/wine-quality
[4] Wikipedia. acids in wine. https://en.wikipedia.org/wiki/Acids_in_wine
[5] Wikipedia. wine fault. Retrieved from https://en.wikipedia.org/wiki/Wine_fault#Acetic_acid
[6] Wine in Moderation.com. How many grams of alcohol in wine?. Retrieved from https://www.wineinmoderation.eu/en/articles/How-many-grams-of-alcohol-in-wine.154/